Bayesian Logistic Regression Capstone
  • Research
  • Slides
  • About Us

On this page

  • Introduction
    • Aims
  • Method
    • Bayesian Logistic Regression
    • Model Structure
    • Prior Specification
    • Posterior Predictions
    • Model Evaluation and Diagnostics
    • Advantages of Bayesian Logistic Regression
  • Analysis and Results
    • Data Preparation
      • Data Import and Merging
      • Variable Definitions
      • Study Design and Survey-Weighted Analysis
      • Adult Cohort Definition
      • Missing Data Summary
      • Exploratory Data Analysis
    • Modeling Frameworks
      • Survey-Weighted Logistic Regression (Design-Based MLE)
      • Multivariate Imputation by Chained Equations (Pooled Logistic Regression)
      • Bayesian Logistic Regression
    • Results
      • Population-level interpretation (posterior, odds ratios)
    • Discussion and Limitations
      • Interpretation
      • Limitations
    • Conclusion
  • References

Bayesian Logistic Regression for Predicting Diabetes Risk Using NHANES 2013–2014 Data

A Capstone Project on Bayesian Applications in Epidemiologic Modeling

Authors

Namita Mishra

Autumn Wilcox

Published

November 7, 2025

Slides: slides.html (Edit slides.qmd.)

Introduction

Diabetes mellitus (DM) remains a major public health challenge, and identifying key risk factors—such as obesity, age, sex, and race/ethnicity—is essential for prevention and targeted intervention. Logistic regression is widely used to estimate associations between such factors and binary outcomes like diabetes diagnosis. However, classical maximum likelihood estimation (MLE) can produce unstable estimates in the presence of missing data, quasi-separation, or small samples. Bayesian logistic regression offers a robust alternative by integrating prior information, regularizing estimates, and quantifying uncertainty more transparently than frequentist approaches.

Bayesian hierarchical models, implemented via Markov Chain Monte Carlo (MCMC), have been successfully applied in predicting patient health status across diseases such as pneumonia, prostate cancer, and mental disorders (Zeger et al. 2020). By representing predictive uncertainty alongside point estimates, Bayesian inference offers a practical advantage in epidemiologic modeling where decisions hinge on probabilistic thresholds. Beyond stability, Bayesian methods support model checking, variable selection, and uncertainty quantification under missingness or imputation frameworks (Baldwin and Larson 2017; Kruschke and Liddell 2017).

Recent work has expanded Bayesian applications to disease diagnostics and health risk modeling. For instance, Bayesian approaches have been used to evaluate NHANES diagnostic data (Chatzimichail and Hatjimihail 2023), to model cardiovascular and metabolic risk (Liu et al. 2013), and to integrate multiple data modalities such as imaging and laboratory measures (Abdullah, Hassan, and Mustafa 2022). Moreover, multiple imputation combined with Bayesian modeling generates robust estimates when data are missing at random (MAR) or not at random (MNAR) (Austin et al. 2021).

The broader Bayesian literature emphasizes the role of priors and model checking. Weakly informative priors, such as \(N(0, 2.5)\) for coefficients, regularize estimation and reduce variance in small samples (Gelman et al. 2008; Vande Schoot et al. 2021). Tutorials using R packages like brms and blavaan illustrate how MCMC enables posterior inference and empirical Bayes analysis (Klauenberg et al. 2015).

Beyond standard generalized linear models, Bayesian nonparametric regression flexibly captures nonlinearity and zero inflation common in health data (Richardson and Hartman 2018). Bayesian Additive Regression Trees (BART) improve variable selection in mixed-type data (Luo et al. 2024), while state-space and dynamic Bayesian models incorporate time-varying biomarkers for longitudinal prediction (Momeni et al. 2021). Bayesian model averaging (BMA) further addresses model uncertainty by weighting across multiple specifications (Hoeting et al. 1999). Together, these approaches demonstrate the versatility and growing importance of Bayesian inference in clinical and epidemiologic modeling.

The objective of this project is to evaluate whether Bayesian inference provides more stable and interpretable estimates of diabetes risk than frequentist and imputation-based approaches, particularly when data complexity or separation challenges arise. Agreement across modeling frameworks supports the robustness of these associations and highlights the interpretability and uncertainty quantification advantages offered by Bayesian analysis in population health modeling (National Center for Health Statistics (NCHS) 2014).

Aims

The present study employs Bayesian logistic regression to predict diabetes status and examine the relationships between diabetes and key predictors, including body mass index (BMI), age (≥20 years), sex, and race. Using retrospective data from the 2013–2014 NHANES survey, the analysis accounts for the study’s complex sampling design, which involves stratification, clustering, and the oversampling of specific subpopulations rather than simple random sampling. The Bayesian framework is applied to address common analytical challenges such as missing data, complete case bias, and data separation, thereby improving the robustness and reliability of inference compared to traditional logistic regression methods.

Method

Bayesian Logistic Regression

The Bayesian framework integrates prior knowledge with observed data to generate posterior distributions, allowing parameters to be interpreted directly in probabilistic terms.

Unlike traditional frequentist approaches that yield single-point estimates and p-values, Bayesian methods represent parameters as random variables with full probability distributions.

This provides greater flexibility, incorporates parameter uncertainty, and produces credible intervals that directly quantify the probability that a parameter lies within a given range.

Model Structure

Bayesian logistic regression models the log-odds of a binary outcome as a linear combination of predictors:

\[ \text{logit}(P(Y = 1)) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k \]

where

  • \(P(Y = 1)\) is the probability of the event of interest,
  • \(\beta_0\) is the intercept (log-odds when all predictors are zero), and
  • \(\beta_j\) represents the effect of predictor \(X_j\) on the log-odds of the outcome, holding other predictors constant.

In the Bayesian framework, model parameters (\(\boldsymbol{\beta}\)) are treated as random variables and assigned prior distributions that reflect existing knowledge or plausible ranges before observing the data. After incorporating the observed evidence, the priors are updated through Bayes’ theorem (Leeuw and Klugkist 2012; Klauenberg et al. 2015):

\[ \text{Posterior} \propto \text{Likelihood} \times \text{Prior} \]

  • Likelihood: represents the probability of the observed data given the model parameters—it captures how well different parameter values explain the data.
  • Prior: expresses beliefs or existing information about the parameters before observing the data.
  • Posterior: combines both, representing the updated distribution of parameter values after accounting for the data.

This formulation allows uncertainty to propagate naturally through the model, producing posterior distributions for each coefficient that can be directly interpreted as probabilities.

Prior Specification

Weakly informative priors were used to regularize estimation without imposing strong assumptions:

  • Regression Coefficients: \(N(0, 2.5)\), providing gentle regularization while allowing substantial variation in plausible effects (Gelman et al. 2008; Vande Schoot et al. 2021).
  • Intercept: Student’s t-distribution prior, \(t(3, 0, 10)\) (Schoot et al. 2013; Vande Schoot et al. 2021), which has
    • 3 degrees of freedom (heavy tails to allow occasional large effects),
    • mean 0 (no bias toward positive or negative effects), and
    • scale 10 (broad range of possible values).

Such priors help stabilize estimation in the presence of multicollinearity, limited sample size, or potential outliers.

Posterior Predictions

Posterior distributions of regression coefficients were used to estimate the probability of the outcome for given predictor values. This allows statements such as: > Given the predictors, the probability of the outcome lies between X% and Y%.

Posterior predictions account for two key sources of uncertainty:

  1. Parameter Uncertainty: Variability in estimated model coefficients.
  2. Predictive Uncertainty: Variability in possible future outcomes given those parameters.

In Bayesian analysis, all unknown quantities—coefficients, means, variances, or probabilities—are treated as random variables described by their posterior distributions.

Model Evaluation and Diagnostics

Model quality and convergence were assessed using standard Bayesian diagnostics:

  • Posterior Sampling: Conducted via Markov Chain Monte Carlo (MCMC) using the No-U-Turn Sampler (NUTS), a variant of Hamiltonian Monte Carlo (HMC) (Austin et al. 2021). Four chains were run with sufficient warm-up iterations to ensure convergence.
  • Convergence Metrics: The potential scale reduction factor (\(\hat{R}\)) and effective sample size (ESS) were used to verify stability and mixing across chains.
  • Autocorrelation Checks: Ensured independence between successive draws.
  • Posterior Predictive Checks (PPCs): Compared simulated outcomes to observed data to evaluate fit.
  • Bayesian \(R^2\): Quantified the proportion of variance explained by predictors, incorporating posterior uncertainty.

Advantages of Bayesian Logistic Regression

  • Uncertainty Quantification: Produces full posterior distributions instead of single estimates.
  • Credible Intervals: Provide the range within which a parameter lies with a specified probability (e.g., 95%).
  • Flexible Priors: Allow integration of expert knowledge or findings from prior studies.
  • Probabilistic Predictions: Posterior predictive distributions yield direct probabilities for new or future observations.
  • Model Evaluation: PPCs assess how well simulated outcomes reproduce observed data.

Analysis and Results

Data Preparation

This study used publicly available 2013–2014 NHANES data published by the CDC’s National Center for Health Statistics (National Center for Health Statistics (NCHS) 2014). Three component files were utilized: DEMO_H (demographics), BMX_H (body measures), and DIQ_H (diabetes questionnaire). Each file was imported in .XPT format using the haven package in R, and merged using the unique participant identifier SEQN to create a single adult analytic dataset (age ≥ 20 years).

All variables were coerced to consistent numeric or factor types prior to merging to ensure atomic columns suitable for survey-weighted analysis and modeling. The use of SEQN preserved respondent integrity across datasets and ensured accurate record linkage. This preprocessing step standardized variable formats and minimized inconsistencies between files.

Data wrangling, cleaning, and merging were performed in R using a combination of base functions and tidyverse packages. Bayesian logistic regression modeling was later implemented using the brms interface to Stan, allowing probabilistic inference within a reproducible workflow that accommodated the NHANES complex survey design and missing data considerations.

Data Import and Merging

Code
merged_data <- readRDS("data/merged_2013_2014.rds")

merged_n <- nrow(merged_data)

The merged dataset contains 10,175 participants. It integrates demographic, examination, and diabetes questionnaire data. We then restrict the sample to adults (age ≥ 20) to define the analytic cohort used in subsequent analyses. A small proportion of records have missing values in BMI and diabetes status, which will be addressed later through multiple imputation.

Preview of merged NHANES 2013–2014 dataset limited to analysis variables (source columns only).
RIDAGEYR BMXBMI RIAGENDR RIDRETH1 DIQ010
69 26.7 1 4 1
54 28.6 1 3 1
72 28.9 1 3 1
9 17.1 1 3 2
73 19.7 2 3 2
56 41.7 1 1 2
0 NA 1 3 NA
61 35.7 2 3 2
42 NA 1 2 2
56 26.5 2 3 2

Variable Definitions

  • Response Variable:
    diabetes_dx (binary) represents a Type 2 diabetes diagnosis, excluding gestational diabetes. It was derived from DIQ010 (“Doctor told you have diabetes”), while DIQ050 (insulin use) was excluded to prevent treatment-related confounding.

  • Predictor Variables:

    • BMXBMI – Body Mass Index (kg/m^2), treated as continuous and categorized into six BMI classes (bmi_cat).
    • RIDAGEYR – Age (continuous, 20–80 years)
    • RIAGENDR – Sex (factor, two levels)
    • RIDRETH1 – Ethnicity (factor, five levels)
Code
# -----------------------------
# Variable descriptions (with `code` formatting for names)
# -----------------------------
var_tbl <- tribble(
  ~Variable,      ~Description,                                                                                                   ~Type,         ~Origin,
  "`diabetes_dx`","Type 2 diabetes diagnosis (1 = Yes, 0 = No) derived from `DIQ010`; gestational diabetes excluded.",           "Categorical", "Derived from `DIQ010`",
  "`age`",        "Age in years.",                                                                                                "Continuous",  "NHANES `RIDAGEYR`",
  "`bmi`",        "Body Mass Index (kg/m^2) computed from measured height and weight.",                                           "Continuous",  "NHANES `BMXBMI`",
  "`bmi_cat`",    "BMI categories: Underweight, Normal, Overweight, Obesity I–III (`Normal` is reference in models).",            "Categorical", "Derived from `bmi`",
  "`sex`",        "Sex of participant (`Male`, `Female`).",                                                                       "Categorical", "NHANES `RIAGENDR`",
  "`race`", "race/Ethnicity collapsed to four levels: White, Black, Hispanic, Other.", "Categorical", "Derived from `RIDRETH1`",
  "`WTMEC2YR`",   "Examination sample weight for Mobile Examination Center participants.",                                        "Weight",      "NHANES design",
  "`SDMVPSU`",    "Primary Sampling Unit used for variance estimation in the complex survey design.",                             "Design",      "NHANES design",
  "`SDMVSTRA`",   "Stratum identifier used to define strata for the complex survey design.",                                      "Design",      "NHANES design",
  "`age_c`",      "Centered and standardized age (z-score).",                                                                     "Continuous",  "Derived from `age`",
  "`bmi_c`",      "Centered and standardized BMI (z-score).",                                                                     "Continuous",  "Derived from `bmi`"
)

kbl(
  var_tbl,
  caption = "Variable Descriptions: Adult Analytic Dataset",
  align = c("l","l","l","l"),
  escape = FALSE
) %>%
  kable_styling(full_width = FALSE, position = "center", bootstrap_options = c("striped","hover")) %>%
  group_rows("Analysis variables", 1, 6) %>%              # <-- updated range (now 6 analysis rows)
  group_rows("Survey design variables", 7, 9) %>%
  group_rows("Derived variables", 10, 11)
Variable Descriptions: Adult Analytic Dataset
Variable Description Type Origin
Analysis variables
`diabetes_dx` Type 2 diabetes diagnosis (1 = Yes, 0 = No) derived from `DIQ010`; gestational diabetes excluded. Categorical Derived from `DIQ010`
`age` Age in years. Continuous NHANES `RIDAGEYR`
`bmi` Body Mass Index (kg/m^2) computed from measured height and weight. Continuous NHANES `BMXBMI`
`bmi_cat` BMI categories: Underweight, Normal, Overweight, Obesity I–III (`Normal` is reference in models). Categorical Derived from `bmi`
`sex` Sex of participant (`Male`, `Female`). Categorical NHANES `RIAGENDR`
`race` race/Ethnicity collapsed to four levels: White, Black, Hispanic, Other. Categorical Derived from `RIDRETH1`
Survey design variables
`WTMEC2YR` Examination sample weight for Mobile Examination Center participants. Weight NHANES design
`SDMVPSU` Primary Sampling Unit used for variance estimation in the complex survey design. Design NHANES design
`SDMVSTRA` Stratum identifier used to define strata for the complex survey design. Design NHANES design
Derived variables
`age_c` Centered and standardized age (z-score). Continuous Derived from `age`
`bmi_c` Centered and standardized BMI (z-score). Continuous Derived from `bmi`

Study Design and Survey-Weighted Analysis

The National Health and Nutrition Examination Survey (NHANES) employs a complex, multistage probability sampling design with stratification, clustering, and oversampling of specific demographic groups (for example, racial/ethnic minorities and older adults) to produce nationally representative estimates of the U.S. population.

Survey design variables — primary sampling units (SDMVPSU), strata (SDMVSTRA), and examination sample weights (WTMEC2YR) — were retained to account for this complex design. These variables were applied to adjust for unequal probabilities of selection, nonresponse, and oversampling, ensuring valid standard errors, unbiased prevalence estimates, and generalizable population-level inference.

A survey-weighted logistic regression model was used to evaluate the association between diabetes status (diabetes_dx, binary outcome) and key predictors: body mass index (bmi), age (age), sex (sex), and race/ethnicity (race). Diabetes was defined using DIQ010 (“Doctor told you have diabetes”) and coded as 0/1, with DIQ050 (insulin use) excluded to avoid treatment-related confounding.

Covariates included:
- age (continuous; centered as age_c, categorized 20–80 years)
- bmi (continuous; centered as bmi_c, and categorized by BMI class bmi_cat)
- sex (male, female)
- race (four ethnicity levels: White, Black, Hispanic, Other)

This approach accounts for NHANES’ complex sampling design, producing unbiased parameter estimates and valid inference for U.S. adults.

Step Description
Weighting Used the survey package to calculate weighted means for key variables (e.g., age and diabetes status) and to estimate design effects and effective sample size for the complex survey design.
Standardization Centered and standardized BMI and age (bmi_c, age_c) for use in regression models.
Age Categorization Not implemented in the analytic dataset (continuous age retained). Reference retained for potential descriptive grouping (20–<30, 30–<40, 40–<50, 50–<60, 60–<70, 70–80).
BMI Categorization Recoded as: <18.5 (Underweight), 18.5–<25 (Normal), 25–<30 (Overweight), 30–<35 (Obesity I), 35–<40 (Obesity II), ≥40 (Obesity III).
Ethnicity Recoding RIDRETH1 recoded as: 1 = Mexican American, 2 = Other Hispanic, 3 = Non-Hispanic White, 4 = Non-Hispanic Black, 5 = Other/Multi; then NH White set as the reference level (five analytical levels retained).
Special Codes Transformed nonresponse codes (e.g., 3, 7, 9) to NA. These missing codes were evaluated for potential nonrandom patterns (MAR/MNAR).
Missing Data Retained and visualized missing values (primarily in BMI and diabetes status) to assess their pattern and informativeness before multiple imputation.
Final Dataset Created the cleaned analytic dataset (adult) using Non-Hispanic White and Male as reference groups for modeling, preserving NHANES survey design variables (WTMEC2YR, SDMVPSU, SDMVSTRA).

Adult Cohort Definition

Code
# NHANES survey design object for the adult analytic cohort

nhanes_design_adult <- survey::svydesign(
id      = ~SDMVPSU,
strata  = ~SDMVSTRA,
weights = ~WTMEC2YR,
nest    = TRUE,
data    = adult
)

# Quick weighted checks

survey::svymean(~age, nhanes_design_adult, na.rm = TRUE)
      mean     SE
age 47.496 0.3805
Code
survey::svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)
                mean     SE
diabetes_dx 0.089016 0.0048
Code
# Design effect and effective sample size for `diabetes_dx`

v_hat <- as.numeric(survey::svyvar(~diabetes_dx, nhanes_design_adult, na.rm = TRUE))
p_hat <- mean(adult$diabetes_dx, na.rm = TRUE)
n_obs <- nrow(adult)
v_srs <- p_hat * (1 - p_hat) / n_obs
deff  <- v_hat / v_srs

n_total <- sum(weights(nhanes_design_adult), na.rm = TRUE)
ess     <- as.numeric(n_total / deff)

cat("Design effect for diabetes_dx:", round(deff, 2), "\n")
Design effect for diabetes_dx: 4759.91 
Code
cat("Effective sample size for diabetes_dx:", round(ess), "\n")
Effective sample size for diabetes_dx: 48142 

Descriptive statistics for continuous and categorical variables are presented below.

Code
# Keep only analytic variables for Table 1
tbl1_dat <- adult %>%
  transmute(
    age,
    bmi,
    bmi_cat,
    sex,
    race,
    diabetes_dx = factor(diabetes_dx, levels = c(0, 1), labels = c("No", "Yes"))
  )

# Continuous summaries: N, missing, mean, sd, min, max
cont_vars <- c("age", "bmi")

cont_sum <- tbl1_dat %>%
  select(all_of(cont_vars)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "value") %>%
  group_by(Variable) %>%
  summarise(
    N       = sum(!is.na(value)),
    Missing = sum(is.na(value)),
    Mean    = round(mean(value, na.rm = TRUE), 2),
    SD      = round(sd(value, na.rm = TRUE), 2),
    Min     = round(min(value, na.rm = TRUE), 1),
    Max     = round(max(value, na.rm = TRUE), 1),
    .groups = "drop"
  )

# Categorical summaries: counts and percents
cat_vars <- c("sex", "race", "diabetes_dx", "bmi_cat")

cat_sum <- tbl1_dat %>%
  mutate(across(all_of(cat_vars),
                ~ forcats::fct_explicit_na(as.factor(.x), na_level = "(Missing)"))) %>%
  select(all_of(cat_vars)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Level") %>%
  count(Variable, Level, name = "n") %>%
  group_by(Variable) %>%
  mutate(pct = round(100 * n / sum(n), 1)) %>%
  ungroup() %>%
  arrange(Variable, desc(n))

# Render tables
kable(cont_sum,
      caption = "Table 1a. Continuous variables (age, BMI): N, missing, mean (SD), range.") %>%
  kable_styling(full_width = FALSE)
Table 1a. Continuous variables (age, BMI): N, missing, mean (SD), range.
Variable N Missing Mean SD Min Max
age 5769 0 49.11 17.56 20.0 80.0
bmi 5520 249 29.10 7.15 14.1 82.9
Code
kable(cat_sum,
      caption = "Table 1b. Categorical variables (sex, race, diabetes_dx, bmi_cat): counts and percentages.") %>%
  kable_styling(full_width = FALSE)
Table 1b. Categorical variables (sex, race, diabetes_dx, bmi_cat): counts and percentages.
Variable Level n pct
bmi_cat 25–<30 1768 30.6
bmi_cat 18.5–<25 1579 27.4
bmi_cat 30–<35 1145 19.8
bmi_cat 35–<40 519 9.0
bmi_cat ≥40 419 7.3
bmi_cat (Missing) 249 4.3
bmi_cat <18.5 90 1.6
diabetes_dx No 4974 86.2
diabetes_dx Yes 618 10.7
diabetes_dx (Missing) 177 3.1
race NH White 2472 42.8
race NH Black 1177 20.4
race Other/Multi 845 14.6
race Mexican American 767 13.3
race Other Hispanic 508 8.8
sex Female 3011 52.2
sex Male 2758 47.8

Table 1a and 1b summarize the analytic variables included in subsequent models. Mean age and BMI values indicate an adult cohort spanning a wide range of body composition, while categorical summaries confirm balanced sex representation and sufficient sample sizes across race/ethnicity categories. These variables were standardized and used as predictors in all modeling frameworks (analytic cohort N = 5,769 adults ≥ 20 years).

Code
adult_n <- nrow(adult)
Table 1: Excerpt of the cleaned NHANES 2013–2014 adult cohort (age ≥ 20; N = 5,769) with derived and standardized variables.
SDMVPSU SDMVSTRA WTMEC2YR diabetes_dx bmi age sex race DIQ050 age_c bmi_c bmi_cat
1 112 13481.04 1 26.7 69 Male NH Black 1 1.1324183 -0.3358861 25–<30
1 108 24471.77 1 28.6 54 Male NH White 1 0.2783598 -0.0702810 25–<30
1 109 57193.29 1 28.9 72 Male NH White 1 1.3032300 -0.0283434 25–<30
2 116 65541.87 0 19.7 73 Female NH White 2 1.3601672 -1.3144311 18.5–<25
1 111 25344.99 0 41.7 56 Male Mexican American 2 0.3922343 1.7609961 ≥40
1 114 61758.65 0 35.7 61 Female NH White 2 0.6769204 0.9222432 35–<40

As shown in Table 1, the analytic adult cohort (N = 5,769) includes standardized variables for age and BMI (age_c, bmi_c), categorical indicators for sex and race/ethnicity (race), and a binary doctor-diagnosed diabetes variable (diabetes_dx).

Code
# Textual structure and preview
str(adult)
'data.frame':   5769 obs. of  12 variables:
 $ SDMVPSU    : num  1 1 1 2 1 1 2 1 2 2 ...
 $ SDMVSTRA   : num  112 108 109 116 111 114 106 112 112 113 ...
 $ WTMEC2YR   : num  13481 24472 57193 65542 25345 ...
 $ diabetes_dx: num  1 1 1 0 0 0 0 0 0 0 ...
 $ bmi        : num  26.7 28.6 28.9 19.7 41.7 35.7 NA 26.5 22 20.3 ...
 $ age        : num  69 54 72 73 56 61 42 56 65 26 ...
 $ sex        : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 2 1 2 1 2 ...
 $ race       : Factor w/ 5 levels "NH White","Mexican American",..: 4 1 1 1 2 1 3 1 1 1 ...
 $ DIQ050     : num  1 1 1 2 2 2 2 2 2 2 ...
 $ age_c      : num  1.132 0.278 1.303 1.36 0.392 ...
 $ bmi_c      : num  -0.3359 -0.0703 -0.0283 -1.3144 1.761 ...
 $ bmi_cat    : Factor w/ 6 levels "<18.5","18.5–<25",..: 3 3 3 2 6 5 NA 3 2 2 ...
Code
# Visual structure and type overview
plot_intro(adult, title = "Adult Dataset: Variable Types and Completeness")

Missing Data Summary

Code
# Visualize missing data pattern
plot_missing(adult, title = "Missing Data Pattern (Adult Dataset)")

Missing data were minimal across analytic variables. BMI-related fields (bmi, bmi_c, bmi_cat) showed ~4.3% missingness, and diabetes_dx showed ~3.1%. All demographic and survey design variables were complete, indicating that missingness was limited to health-related measures and appropriate for multiple imputation.
Code
# Summarize missingness for key analysis variables

miss_tbl <- tibble::tibble(
Variable    = c("bmi", "diabetes_dx"),
Missing_n   = c(sum(is.na(adult_eda$bmi)), sum(is.na(adult_eda$diabetes_dx))),
Missing_pct = round(c(mean(is.na(adult_eda$bmi)), mean(is.na(adult_eda$diabetes_dx))) * 100, 1)
)

knitr::kable(
miss_tbl,
caption = "Missingness for Key Analysis Variables."
)
Missingness for Key Analysis Variables.
Variable Missing_n Missing_pct
bmi 249 4.3
diabetes_dx 177 3.1

Overall missingness was low (~7.3%). Gaps were concentrated in bmi (n = 249) and diabetes_dx (n = 177), while demographic and design variables were complete. This limited pattern of missingness is consistent with a Missing At Random (MAR) mechanism and likely reflects reduced participation in the physical examination component among certain adults.

Code
# Sanity checks for missingness

colSums(is.na(adult_eda))
    SDMVPSU    SDMVSTRA    WTMEC2YR diabetes_dx         bmi         age 
          0           0           0         177         249           0 
        sex        race       age_c       bmi_c     bmi_cat 
          0           0           0         249           0 
Code
sapply(adult_eda[, c("diabetes_dx", "bmi", "bmi_c", "age")], function(x) sum(is.na(x)))
diabetes_dx         bmi       bmi_c         age 
        177         249         249           0 

Exploratory Data Analysis

Following the missing data assessment, exploratory analyses were conducted to describe the adult analytic cohort and visualize distributions across key demographic and health variables. The goal was to examine univariate patterns and bivariate relationships relevant to diabetes prevalence before modeling.

The adult analytic cohort was broadly representative of the U.S. population, with a majority identifying as Non-Hispanic White. Age and BMI distributions were right-skewed, with most participants classified as overweight or obese. Visual exploration revealed a clear positive relationship between age, BMI, and diabetes prevalence. Non-Hispanic Black and Hispanic participants exhibited higher proportions of diabetes compared to Non-Hispanic Whites.

Approximately 25% of variables were categorical (for example, sex, race, diabetes_dx) and 75% were continuous (age, bmi, age_c, bmi_c), indicating that the dataset primarily consisted of measured numeric values such as BMI and age. About 93% of rows contained complete information across all predictors and outcomes, reflecting high data quality.

Adult age ranged from 20 to 80 years, with peak representation between 30 and 50 years and a slight right skew toward older ages. BMI was concentrated in the overweight and obese ranges. Female participants were slightly over represented relative to Male participants.

Code
# Age distribution (analytic adult)
ggplot(adult, aes(x = age)) +
  geom_histogram(binwidth = 5, color = "white") +
  labs(title = "Distribution of Age (≥20 years)", x = "Age (years)", y = "Count") +
  theme_minimal()

Code
# Diabetes outcome distribution
ggplot(adult, aes(x = factor(diabetes_dx, levels = c(0,1), labels = c("No","Yes")))) +
  geom_bar() +
  labs(title = "Diabetes Outcome Distribution (≥20 years)", x = "Diabetes (No/Yes)", y = "Count") +
  theme_minimal()

Code
# BMI category distribution
ggplot(adult, aes(x = bmi_cat)) +
  geom_bar(color = "white", fill = "skyblue") +
  labs(title = "Distribution of BMI Categories (≥20 years)", x = "BMI Category", y = "Count") +
  theme_minimal()

Code
# BMI by diabetes outcome (boxplot)
# (You can’t use boxplot with categorical y, so revert to numeric BMI here)
ggplot(adult, aes(x = factor(diabetes_dx, levels = c(0,1), labels = c("No","Yes")), y = bmi)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "BMI by Diabetes Diagnosis (≥20 years)", x = "Diabetes (No/Yes)", y = "BMI (numeric)") +
  theme_minimal()

Code
# Diabetes by race (dodged bars)
ggplot(adult, aes(x = race, fill = factor(diabetes_dx, levels = c(0,1), labels = c("No","Yes")))) +
  geom_bar(position = "dodge") +
  labs(title = "Diabetes Diagnosis by race/Ethnicity (≥20 years)",
       x = "race/Ethnicity (race)", y = "Count", fill = "Diabetes") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The EDA missingness summary shows approximately 4.3% missing BMI and 3.1% missing diabetes diagnosis (diabetes_dx). All design variables (WTMEC2YR, SDMVPSU, SDMVSTRA), as well as age, sex, and race, are complete—sex and race NAs are encoded as explicit “(Missing)” levels in the EDA view.

Modeling Frameworks

Three modeling frameworks were compared using identical predictors (standardized age, BMI, sex, and race) and the binary outcome diabetes_dx: (1) survey-weighted logistic regression to incorporate the NHANES complex sampling design, (2) multiple imputation (MICE) to address missing BMI values, and (3) Bayesian logistic regression with weakly informative priors to quantify uncertainty.

Survey-Weighted Logistic Regression (Design-Based MLE)

The NHANES 2013–2014 data use a complex, multistage probability design involving strata (SDMVSTRA), primary sampling units (PSUs; SDMVPSU), and examination weights (WTMEC2YR) to ensure nationally representative estimates (National Center for Health Statistics (NCHS) 2014).

Estimates are population-weighted using NHANES survey design variables (WTMEC2YR, SDMVSTRA, SDMVPSU). Odds ratios are reported per one standard deviation (1 SD) increase in age and BMI, with reference groups Male and White.

Code
adult_clean <- adult %>%
  dplyr::mutate(
    sex   = forcats::fct_drop(sex),
    race = forcats::fct_drop(race),
    age_c = as.numeric(age_c),
    bmi_c = as.numeric(bmi_c)
  ) %>%
  dplyr::filter(
    !is.na(diabetes_dx),
    !is.na(age_c),
    !is.na(bmi_c),
    !is.na(sex),
    !is.na(race)
  )

# Structure and sanity checks
str(adult_clean[, c("diabetes_dx","sex","race","age_c","bmi_c")])
'data.frame':   5349 obs. of  5 variables:
 $ diabetes_dx: num  1 1 1 0 0 0 0 0 0 1 ...
 $ sex        : Factor w/ 2 levels "Male","Female": 1 1 1 2 1 2 2 1 2 1 ...
 $ race       : Factor w/ 5 levels "NH White","Mexican American",..: 4 1 1 1 2 1 1 1 1 1 ...
 $ age_c      : num  1.132 0.278 1.303 1.36 0.392 ...
 $ bmi_c      : num  -0.3359 -0.0703 -0.0283 -1.3144 1.761 ...
Code
table(adult_clean$sex)

  Male Female 
  2551   2798 
Code
table(adult_clean$race)

        NH White Mexican American   Other Hispanic         NH Black 
            2293              713              470             1101 
     Other/Multi 
             772 
Code
table(adult_clean$diabetes_dx)

   0    1 
4752  597 
Code
options(survey.lonely.psu = "adjust")

nhanes_design_adult <- survey::svydesign(
  id      = ~SDMVPSU,
  strata  = ~SDMVSTRA,
  weights = ~WTMEC2YR,
  nest    = TRUE,
  data    = adult_clean
)

svy_fit <- survey::svyglm(
  diabetes_dx ~ age_c + bmi_c + sex + race,
  design = nhanes_design_adult,
  family = quasibinomial()
)

svy_or <- broom::tidy(svy_fit, conf.int = TRUE) %>%
  dplyr::mutate(
    OR  = exp(estimate),
    LCL = exp(conf.low),
    UCL = exp(conf.high)
  ) %>%
  dplyr::select(term, OR, LCL, UCL, p.value) %>%
  dplyr::filter(term != "(Intercept)")

Design-based odds ratios are summarized in Table 2.

Code
knitr::kable(svy_or)
Table 2: Survey-weighted logistic regression: odds ratios (OR) and 95% confidence intervals for diabetes diagnosis among adults (NHANES 2013–2014).
term OR LCL UCL p.value
age_c 3.0292807 2.6967690 3.4027912 0.0000000
bmi_c 1.8853571 1.6526296 2.1508579 0.0000039
sexFemale 0.5281132 0.4104905 0.6794397 0.0003857
raceMexican American 2.0358434 1.4850041 2.7910081 0.0008262
raceOther Hispanic 1.5915182 1.1664529 2.1714810 0.0087119
raceNH Black 1.6689718 1.1605895 2.4000450 0.0116773
raceOther/Multi 2.3270527 1.5451752 3.5045697 0.0014331

Interpretation:
Age and BMI are the strongest predictors of diabetes in the NHANES 2013–2014 adult sample, with each 1 SD increase in age nearly tripling diabetes risk and higher BMI contributing substantially to elevated odds. Males show significantly higher odds of diabetes compared to females, consistent with known sex differences in metabolic risk. Racial and ethnic disparities are evident, with Hispanic, Black, and Other/Multi-racial adults all exhibiting significantly higher odds of diabetes than Non-Hispanic Whites. All predictors are statistically significant (p < 0.05), indicating robust associations across demographic and health characteristics.

Multivariate Imputation by Chained Equations (Pooled Logistic Regression)

Multiple Imputation by Chained Equations (MICE) was used as a principled approach for handling missing data (Stef van Buuren and Groothuis-Oudshoorn 2011; S. van Buuren 2012).
MICE iteratively imputes each incomplete variable using regression models based on other variables in the dataset, producing multiple completed datasets that reflect uncertainty due to missingness. Estimates are then pooled across imputations using Rubin’s rules to generate final parameter estimates and confidence intervals.

MICE, as an alternative to the Bayesian approach, effectively manages missing data through chained regression equations without requiring full joint modeling of all variables.

For large sample sizes (n ≥ 400), even in the presence of high percentages (up to 75%) of missing data in one variable, non-normal distributions such as flat densities, heavy tails, skewness, and multimodality do not materially affect mean structure estimation performance (S. van Buuren 2012).

In this study, continuous variables were imputed using regression-based methods: age via normal linear regression (norm) and BMI via predictive mean matching (PMM) to better preserve the empirical BMI distribution. Categorical variables (sex and race) were imputed using logistic and polytomous regression models, respectively. Diabetes status (diabetes_dx) was treated as an outcome variable and was not imputed. Twenty imputations were generated to reduce Monte Carlo error and maintain robust variance estimation.

Convergence and Data Stability
The chained equation process showed stable convergence across iterations, indicating reliable estimation of missing BMI (and, where present, age) values. After MICE, the final imputed dataset consisted of n = 5,592 adults with all key predictor variables completed.

Code
library(mice)
library(dplyr)

# 1. Build data for imputation
mi_dat <- adult %>%
  dplyr::select(
    diabetes_dx, age, bmi, sex, race,
    WTMEC2YR, SDMVPSU, SDMVSTRA
  )

# 2. Set up MICE methods and predictor matrix
meth <- mice::make.method(mi_dat)
pred <- mice::make.predictorMatrix(mi_dat)

# Do NOT impute the outcome; use it as a predictor only
meth["diabetes_dx"] <- ""
pred["diabetes_dx", ] <- 0     # outcome predicts nobody
pred[, "diabetes_dx"]  <- 1    # but everyone is predicted BY diabetes_dx

# Imputation methods (same logic as Namita)
meth["age"]  <- "norm"      # age: normal model
meth["bmi"]  <- "pmm"       # bmi: predictive mean matching
meth["sex"]  <- "polyreg"   # sex: polytomous logistic
meth["race"] <- "polyreg"   # race: polytomous logistic

# Design variables as auxiliaries only (not imputed)
meth[c("WTMEC2YR","SDMVPSU","SDMVSTRA")] <- ""
pred[, c("WTMEC2YR","SDMVPSU","SDMVSTRA")] <- 1

# 3. Run imputation (m=5 like Namita)
imp <- mice::mice(
  mi_dat,
  m               = 5,
  method          = meth,
  predictorMatrix = pred,
  seed            = 123
)

 iter imp variable
  1   1  bmi
  1   2  bmi
  1   3  bmi
  1   4  bmi
  1   5  bmi
  2   1  bmi
  2   2  bmi
  2   3  bmi
  2   4  bmi
  2   5  bmi
  3   1  bmi
  3   2  bmi
  3   3  bmi
  3   4  bmi
  3   5  bmi
  4   1  bmi
  4   2  bmi
  4   3  bmi
  4   4  bmi
  4   5  bmi
  5   1  bmi
  5   2  bmi
  5   3  bmi
  5   4  bmi
  5   5  bmi
Code
# 4. Fit logistic regression within each imputed dataset
fit_mi <- with(
  imp,
  {
    age_c <- as.numeric(scale(age))
    bmi_c <- as.numeric(scale(bmi))
    glm(
      diabetes_dx ~ age_c + bmi_c + sex + race,
      family = binomial()
    )
  }
)

# 5. Pool estimates
pool_mi <- mice::pool(fit_mi)

# 6. Create pooled OR table (this is what tbl-mice needs)
mi_or <- summary(pool_mi, conf.int = TRUE, exponentiate = TRUE) %>%
  dplyr::rename(
    OR  = estimate,
    LCL = `2.5 %`,
    UCL = `97.5 %`
  ) %>%
  dplyr::filter(term != "(Intercept)")
Code
adult_imp1 <- mice::complete(imp, 1) %>%
  dplyr::mutate(
    age_c  = as.numeric(scale(age)),
    bmi_c  = as.numeric(scale(bmi)),
    wt_norm = WTMEC2YR / mean(WTMEC2YR, na.rm = TRUE),
    race = forcats::fct_relevel(race, "NH White"),
    sex  = forcats::fct_relevel(sex,  "Male")
  ) %>%
  dplyr::filter(
    !is.na(diabetes_dx),
    !is.na(age_c),
    !is.na(bmi_c),
    !is.na(sex),
    !is.na(race)
  ) %>%
  droplevels()

glimpse(adult_imp1)
Rows: 5,592
Columns: 11
$ diabetes_dx <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ age         <dbl> 69, 54, 72, 73, 56, 61, 42, 56, 65, 26, 76, 33, 32, 38, 50…
$ bmi         <dbl> 26.7, 28.6, 28.9, 19.7, 41.7, 35.7, 23.6, 26.5, 22.0, 20.3…
$ sex         <fct> Male, Male, Male, Female, Male, Female, Male, Female, Male…
$ race        <fct> NH Black, NH White, NH White, NH White, Mexican American, …
$ WTMEC2YR    <dbl> 13481.04, 24471.77, 57193.29, 65541.87, 25344.99, 61758.65…
$ SDMVPSU     <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2…
$ SDMVSTRA    <dbl> 112, 108, 109, 116, 111, 114, 106, 112, 112, 113, 116, 114…
$ age_c       <dbl> 1.13241831, 0.27835981, 1.30323001, 1.36016725, 0.39223428…
$ bmi_c       <dbl> -0.33319172, -0.06755778, -0.02561558, -1.31184309, 1.7639…
$ wt_norm     <dbl> 0.3393916, 0.6160884, 1.4398681, 1.6500477, 0.6380722, 1.5…

Descriptive Results (Imputed Dataset)

After imputation, the analytic dataset contained approximately 5,500–5,600 adults. The mean age was around 49 years (SD ≈ 17), and the mean BMI was approximately 29 (SD ≈ 7). Females represented roughly 52% of participants, and the majority identified as Non-Hispanic White (~43%). Diabetes prevalence was about 11%, consistent with population-level NHANES estimates.

Code
# Correlation heatmap

correlation_matrix <- cor(adult_imp1[, c("diabetes_dx", "age", "bmi")], use = "complete.obs", method = "pearson")
correlation_melted <- melt(correlation_matrix)

ggplot(correlation_melted, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Correlation Heatmap: Diabetes, Age, and BMI")

Code
# Diabetes diagnosis distribution

ggplot(adult_imp1, aes(x = factor(diabetes_dx))) +
geom_bar(fill = "steelblue") +
labs(title = "Diabetes Diagnosis Distribution", x = "Diabetes (0 = No, 1 = Yes)", y = "Count") +
theme_minimal()

Code
# BMI distribution by diabetes status

ggplot(adult_imp1, aes(x = factor(diabetes_dx), y = bmi, fill = factor(diabetes_dx))) +
geom_boxplot(alpha = 0.7) +
scale_x_discrete(labels = c("0" = "No Diabetes", "1" = "Diabetes")) +
labs(x = "Diabetes Diagnosis", y = "BMI", title = "BMI Distribution by Diabetes Status") +
theme_minimal() +
theme(legend.position = "none")

Code
# Predicted probability of diabetes vs BMI

ggplot(adult_imp1, aes(x = bmi, y = diabetes_dx)) +
geom_point(alpha = 0.2, position = position_jitter(height = 0.02)) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = TRUE, color = "blue") +
labs(x = "BMI", y = "Probability of Diabetes", title = "Predicted Probability of Diabetes vs BMI") +
theme_minimal()

Code
miss_age  <- sum(is.na(mi_dat$age))
miss_bmiN <- sum(is.na(mi_dat$bmi))

mi_caption <- if (miss_age > 0 && miss_bmiN > 0) {
"Multiple Imputation (MICE): pooled odds ratios (OR) and 95% confidence intervals after imputing missing age (normal) and BMI (PMM) (m = 20); diabetes status was not imputed."
} else if (miss_bmiN > 0) {
"Multiple Imputation (MICE): pooled odds ratios (OR) and 95% confidence intervals after imputing missing BMI (PMM) (m = 20); diabetes status was not imputed."
} else if (miss_age > 0) {
"Multiple Imputation (MICE): pooled odds ratios (OR) and 95% confidence intervals after imputing missing age (normal) (m = 20); diabetes status was not imputed."
} else {
"Multiple Imputation (MICE): pooled odds ratios (OR) and 95% confidence intervals (no variables required imputation); diabetes status was not imputed."
}
mi_caption <- paste0(mi_caption, " Odds ratios are per 1 SD for age and BMI.")

knitr::kable(mi_or, caption = mi_caption)
Table 3: Multiple Imputation (MICE): pooled odds ratios (OR) and 95% confidence intervals after imputing missing BMI (PMM) (m = 20); diabetes status was not imputed. Odds ratios are per 1 SD for age and BMI.
term OR std.error statistic df p.value LCL UCL conf.low conf.high
2 age_c 2.9038183 0.0559473 19.054108 5520.446 0.0000000 2.6021752 3.2404277 2.6021752 3.2404277
3 bmi_c 1.7278084 0.0447339 12.224604 5148.557 0.0000000 1.5827382 1.8861754 1.5827382 1.8861754
4 sexFemale 0.5391132 0.0937913 -6.587282 5551.660 0.0000000 0.4485669 0.6479368 0.4485669 0.6479368
5 raceMexican American 2.4296216 0.1375046 6.456041 5472.583 0.0000000 1.8555327 3.1813298 1.8555327 3.1813298
6 raceOther Hispanic 1.7518320 0.1748554 3.206433 5573.987 0.0013515 1.2434346 2.4680953 1.2434346 2.4680953
7 raceNH Black 1.9757793 0.1198118 5.683602 5576.734 0.0000000 1.5621842 2.4988753 1.5621842 2.4988753
8 raceOther/Multi 2.1120110 0.1530066 4.886328 4749.963 0.0000011 1.5646727 2.8508138 1.5646727 2.8508138

Interpretation

  • Age and BMI are strong positive predictors of diabetes; each 1 SD increase substantially raises the odds.

  • Sex: Females show significantly lower odds of diabetes compared to males.

  • race/Ethnicity: All non-White racial/ethnic groups have higher odds of diabetes compared to Non-Hispanic Whites, underscoring persistent disparities in diabetes risk.

  • Model significance: All predictors are statistically significant (p < 0.05).

  • Results are consistent with those from the survey-weighted model, confirming robustness to imputation.

Bayesian Logistic Regression

Bayesian logistic regression was used to quantify parameter uncertainty and compare posterior estimates with the survey-weighted and MICE models. Weakly informative priors were used to regularize estimates while preserving flexibility in inference.

Model Specifications: - Family: Bernoulli with logit link
- Data: adult_imp1 (N = 5,592)
- Chains: 4 (2,000 iterations each; 1,000 warmup)
- Adaptation delta: 0.95
- Weights: normalized NHANES exam weights (wt_norm, mean ≈ 1.00, SD ≈ 0.79)
- Predictors: standardized age, BMI, sex, and race

Define Model and Priors

Code
fml_bayes <- diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race

priors <- c(
  brms::set_prior("normal(0, 2.5)", class = "b"),
  brms::set_prior("student_t(3, 0, 10)", class = "Intercept")
)
Code
adult_long <- adult_imp1 %>%
dplyr::select(bmi_c, age_c) %>%
tidyr::pivot_longer(
cols = dplyr::everything(),
names_to = "Coefficient",
values_to = "Value"
)

ggplot2::ggplot(adult_long, ggplot2::aes(x = Value, fill = Coefficient)) +
ggplot2::geom_density(alpha = 0.5) +
ggplot2::theme_minimal() +
ggplot2::labs(
title = "Distributions for Standardized Age and BMI (adult_imp1)",
x = "Standardized value (z-score)",
y = "Density",
fill = "Coefficient"
)

Distributions for standardized BMI and Age (adult_imp1).
Code
prior_draws <- tibble::tibble(
term = rep(c("Age (per 1 SD)", "BMI (per 1 SD)"), each = 4000),
value = c(
stats::rnorm(4000, mean = 0, sd = 2.5),
stats::rnorm(4000, mean = 0, sd = 2.5)
)
)

ggplot2::ggplot(prior_draws, ggplot2::aes(x = value, fill = term)) +
ggplot2::geom_density(alpha = 0.5) +
ggplot2::theme_minimal() +
ggplot2::labs(
title = "Prior Distributions for Age and BMI Coefficients",
x = "Coefficient value",
y = "Density",
fill = NULL
)

Prior distributions for Age and BMI coefficients (Normal(0, 2.5)).

Fit the Model

Code
library(brms)

priors <- c(
  brms::set_prior("normal(0, 2.5)", class = "b"),
  brms::set_prior("student_t(3, 0, 10)", class = "Intercept")
)

bayes_fit <- brms::brm(
  formula = diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race,
  data    = adult_imp1,
  family  = bernoulli(link = "logit"),
  prior   = priors,
  chains  = 4, iter = 2000, seed = 123,
  control = list(adapt_delta = 0.95),
  refresh = 0
)
Running MCMC with 4 sequential chains...

Chain 1 finished in 11.3 seconds.
Chain 2 finished in 10.5 seconds.
Chain 3 finished in 10.6 seconds.
Chain 4 finished in 11.2 seconds.

All 4 chains finished successfully.
Mean chain execution time: 10.9 seconds.
Total execution time: 43.8 seconds.
Code
summary(bayes_fit)
 Family: bernoulli 
  Links: mu = logit 
Formula: diabetes_dx | weights(wt_norm) ~ age_c + bmi_c + sex + race 
   Data: adult_imp1 (Number of observations: 5592) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept              -2.66      0.09    -2.83    -2.50 1.00     3548     3512
age_c                   1.10      0.06     0.98     1.22 1.00     2349     2618
bmi_c                   0.63      0.05     0.54     0.72 1.00     3327     2826
sexFemale              -0.66      0.10    -0.86    -0.47 1.00     3668     3124
raceMexicanAmerican     0.69      0.17     0.34     1.03 1.00     3657     2821
raceOtherHispanic       0.43      0.25    -0.07     0.89 1.00     4242     3014
raceNHBlack             0.53      0.15     0.23     0.83 1.00     3809     3012
raceOtherDMulti         0.81      0.19     0.45     1.18 1.00     3948     2809

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Posterior Odd Ratios (Main Results)

Code
knitr::kable(
dplyr::mutate(bayes_or, dplyr::across(c(OR, LCL, UCL), ~ round(.x, 2)))
)
Table 4: Bayesian logistic regression: posterior odds ratios (OR) with 95% credible intervals.
term Estimate Est.Error Q2.5 Q97.5 OR LCL UCL
Intercept -2.6633187 0.0868613 -2.8341138 -2.4958967 0.07 0.06 0.08
age_c 1.0968784 0.0618886 0.9783744 1.2200119 2.99 2.66 3.39
bmi_c 0.6282273 0.0467939 0.5366821 0.7199012 1.87 1.71 2.05
sexFemale -0.6624742 0.1034594 -0.8645869 -0.4660003 0.52 0.42 0.63
raceMexicanAmerican 0.6898163 0.1710160 0.3432716 1.0298163 1.99 1.41 2.80
raceOtherHispanic 0.4252184 0.2458586 -0.0669575 0.8870126 1.53 0.94 2.43
raceNHBlack 0.5307334 0.1524774 0.2283617 0.8328511 1.70 1.26 2.30
raceOtherDMulti 0.8143883 0.1876762 0.4467512 1.1763335 2.26 1.56 3.24

Age and BMI show strong positive associations with diabetes (credible intervals exclude 1).

Female sex shows lower odds than male (protective factor).

Non-White racial groups have higher odds compared with Whites, consistent with known disparities.

All model parameters exhibit well-defined, unimodal posteriors with narrow credible intervals.

Diagnostics and Model Fit

Code
knitr::kable(as.data.frame(brms::bayes_R2(bayes_fit)))
Table 5: Bayesian R² summary.
Estimate Est.Error Q2.5 Q97.5
R2 0.1316278 0.0123417 0.107432 0.1565549
Code
diag <- posterior::summarise_draws(bayes_fit, "rhat", "ess_bulk", "ess_tail")

diag_b <- diag |>
dplyr::as_tibble() |>
dplyr::filter(grepl("^b_", .data$variable)) |>
dplyr::transmute(
Parameter = .data$variable,
Rhat      = .data$rhat,
Bulk_ESS  = .data$ess_bulk,
Tail_ESS  = .data$ess_tail
)

knitr::kable(diag_b, digits = 1)
Table 6: MCMC diagnostics (R-hat and Effective Sample Sizes) for model parameters.
Parameter Rhat Bulk_ESS Tail_ESS
b_Intercept 1 3548.0 3511.8
b_age_c 1 2349.3 2617.8
b_bmi_c 1 3327.1 2825.9
b_sexFemale 1 3668.1 3123.7
b_raceMexicanAmerican 1 3656.6 2821.2
b_raceOtherHispanic 1 4242.3 3013.5
b_raceNHBlack 1 3809.1 3012.2
b_raceOtherDMulti 1 3947.9 2809.1

All parameters achieved R̂ ≈ 1.00 and effective sample sizes >2,000, indicating excellent convergence. The Bayesian R² ≈ 0.13, showing that age, BMI, sex, and race explain about 13% of diabetes variability.

Model Comparison

Code
invisible(capture.output({
fit_no_race <- update(bayes_fit, formula = update(fml_bayes, . ~ . - race))
fit_no_sex  <- update(bayes_fit, formula = update(fml_bayes, . ~ . - sex))
}))

loo_base    <- loo::loo(bayes_fit)
loo_no_race <- loo::loo(fit_no_race)
loo_no_sex  <- loo::loo(fit_no_sex)

cmp_df <- as.data.frame(loo::loo_compare(loo_base, loo_no_race, loo_no_sex))
cmp_df$Model <- rownames(cmp_df)
cmp_df <- cmp_df[, c("Model", setdiff(names(cmp_df), "Model"))]

knitr::kable(
cmp_df,
caption = "LOO comparison (higher elpd_loo indicates better predictive performance)."
)
Bayesian model comparison (LOO): base model vs. reduced models without race or sex.
Model elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
bayes_fit bayes_fit 0.00000 0.000000 -1418.258 56.42097 8.732434 0.5944729 2836.517 112.8419
fit_no_race fit_no_race -14.43171 6.367627 -1432.690 53.98749 5.223838 0.3831466 2865.380 107.9750
fit_no_sex fit_no_sex -20.04611 8.205833 -1438.305 57.31024 7.359525 0.5226182 2876.609 114.6205

Models excluding race or sex had lower expected log predictive density (elpd), confirming that both variables contribute meaningfully to model fit.

Posterior Predictive Checks

Code
yobs <- adult_imp1$diabetes_dx
Code
bayesplot::pp_check(bayes_fit, type = "bars", nsamples = 100)
Figure 1: Posterior predictive check: observed vs. replicated outcome distribution (bars).
Code
yrep <- brms::posterior_predict(bayes_fit, ndraws = 400)
bayesplot::ppc_stat(y = yobs, yrep = yrep, stat = "mean")
Figure 2: Posterior predictive checks for mean of the binary outcome.
Code
yrep <- brms::posterior_predict(bayes_fit, ndraws = 400)
bayesplot::ppc_stat(y = yobs, yrep = yrep, stat = "sd")
Figure 3: Posterior predictive checks for standard deviation of the binary outcome.

Posterior predictive checks show excellent model calibration: simulated means and standard deviations closely match observed data.

MCMC Diagnostics and Posterior Distributions

Code
bayesplot::mcmc_areas(as.array(bayes_fit), regex_pars = "^b_", prob = 0.95)
Figure 4: Posterior distributions (95% credible mass) for slope parameters.
Code
bayesplot::mcmc_trace(as.array(bayes_fit), regex_pars = "^b_")
Figure 5: Trace plots for slope parameters (chain mixing and stationarity).
Code
post_array <- posterior::as_draws_array(bayes_fit)
bayesplot::mcmc_acf(post_array, pars = c("b_age_c", "b_bmi_c"))
Figure 6: Autocorrelation plots for posterior samples of age and BMI coefficients (MCMC diagnostics).

Trace, density, and autocorrelation plots confirm smooth mixing, unimodal posteriors, and minimal autocorrelation across chains.

All four chains showed strong convergence with no signs of divergence or non-stationarity.
Trace plots revealed stable, overlapping chains with consistent mixing across iterations.
Autocorrelation decayed rapidly toward zero, confirming efficient sampling and low dependency between successive draws.
Together with R̂ ≈ 1.00 and high effective sample sizes, these diagnostics indicate a well-behaved posterior and reliable inference.

Prior vs. Posterior

Code
src_obj <- if (exists("bayes_fit_prior") && !is.null(bayes_fit_prior)) bayes_fit_prior else bayes_fit
draws_names <- names(posterior::as_draws_df(src_obj))
sort(grep("^(b_|prior_b_)", draws_names, value = TRUE))
[1] "b_age_c"               "b_bmi_c"               "b_Intercept"          
[4] "b_raceMexicanAmerican" "b_raceNHBlack"         "b_raceOtherDMulti"    
[7] "b_raceOtherHispanic"   "b_sexFemale"          
Code
src_obj <- if (exists("bayes_fit_prior") && !is.null(bayes_fit_prior)) bayes_fit_prior else bayes_fit
draws_df <- posterior::as_draws_df(src_obj)
all_cols <- names(draws_df)

want_post <- c("b_age_c","b_bmi_c","b_sexFemale","b_raceBlack","b_raceHispanic","b_raceOther")

have_post  <- intersect(want_post, all_cols)
have_prior <- intersect(paste0("prior_", have_post), all_cols)

pairs <- data.frame(post = have_post, prior = paste0("prior_", have_post),
stringsAsFactors = FALSE)
pairs <- pairs[pairs$prior %in% have_prior, , drop = FALSE]

if (nrow(pairs) == 0) {
knitr::asis_output("**No matching prior/posterior parameters found to overlay.**")
} else {
post_long <- tidyr::pivot_longer(
draws_df[, pairs$post, drop = FALSE],
cols = tidyselect::everything(), names_to = "term", values_to = "estimate"
)
post_long$type <- "Posterior"

prior_tmp <- draws_df[, pairs$prior, drop = FALSE]
colnames(prior_tmp) <- pairs$post
prior_long <- tidyr::pivot_longer(
prior_tmp,
cols = tidyselect::everything(), names_to = "term", values_to = "estimate"
)
prior_long$type <- "Prior"

combined_draws <<- dplyr::bind_rows(prior_long, post_long)

lbl <- c(
b_age_c = "Age (1 SD)", b_bmi_c = "BMI (1 SD)",
b_sexFemale = "Female vs Male",
b_raceBlack = "Black vs White",
b_raceHispanic = "Hispanic vs White",
b_raceOther = "Other vs White"
)
combined_draws$term <- factor(
combined_draws$term,
levels = intersect(names(lbl), unique(combined_draws$term)),
labels = lbl[intersect(names(lbl), unique(combined_draws$term))]
)

ggplot2::ggplot(combined_draws, ggplot2::aes(x = estimate, linetype = type)) +
ggplot2::geom_density() +
ggplot2::facet_wrap(~ term, scales = "free", ncol = 2) +
ggplot2::labs(x = "Coefficient (log-odds)", y = "Density", linetype = NULL) +
ggplot2::theme_minimal()
}

No matching prior/posterior parameters found to overlay.

Figure 7: Prior (dashed) vs posterior (solid) densities for selected coefficients.
Code
if (exists("combined_draws") && is.data.frame(combined_draws) && nrow(combined_draws) > 0) {
ggplot(combined_draws, aes(x = estimate, fill = type)) +
geom_density(alpha = 0.4) +
facet_wrap(~ term, scales = "free", ncol = 2) +
theme_minimal(base_size = 13) +
labs(
title = "Prior vs Posterior Distributions",
x = "Coefficient estimate",
y = "Density",
fill = ""
)
} else {
knitr::asis_output("**Skipped: no matching prior/posterior draws to plot.**")
}

Skipped: no matching prior/posterior draws to plot.

Figure 8: Prior vs Posterior Distributions (ggplot2 version).

For age and BMI, the posterior densities shift notably away from the N(0, 2.5) prior toward positive values and are narrower, indicating strong information from the data; for sex, the posterior remains closer to the prior with more overlap, indicating weaker evidence.

The overlay of prior and posterior densities illustrates that informative updates occurred primarily for BMI, age, and race coefficients, which showed distinct posterior shifts relative to the priors. In contrast, weaker predictors such as sex displayed overlapping distributions, indicating that inference for those parameters was more influenced by prior uncertainty than by the observed data. This balance confirms appropriate regularization rather than overfitting.

Model Fit and Calibration

Code
pred_mean <- colMeans(brms::posterior_epred(bayes_fit))
ggplot(data.frame(pred = pred_mean, obs = yobs),
aes(x = pred, y = obs)) +
geom_point(alpha = 0.15, position = position_jitter(height = 0.03)) +
geom_smooth(method = "loess", se = TRUE) +
labs(x = "Mean predicted probability", y = "Observed diabetes (0/1)")
Figure 9: Observed outcome vs. mean predicted probability (calibration scatter with smoother).
Code
# 1. Survey-weighted prevalence
svy_mean <- svymean(~diabetes_dx, nhanes_design_adult, na.rm = TRUE)

# 2. Posterior predictive prevalence (per draw)
pp_samples <- brms::posterior_predict(bayes_fit, ndraws = 1000)  # draws x individuals
pp_proportion <- rowMeans(pp_samples)                            # prevalence per draw

# 3. Build comparison table
summary_table <- tibble(
  Method = c("Survey-weighted mean (NHANES)", 
             "Imputed dataset mean", 
             "Posterior predictive mean"),
  diabetes_mean = c(
    coef(svy_mean),                           # survey-weighted mean
    mean(adult_imp1$diabetes_dx, na.rm = TRUE),  # imputed dataset
    mean(pp_proportion)                       # posterior predictive mean
  ),
  SE = c(
    SE(svy_mean),   # survey-weighted SE
    NA,             # not available for raw mean
    NA              # not available for posterior predictive mean
  )
)

kable(summary_table, digits = 4,
      caption = "Comparison of Diabetes Prevalence Across Methods")
Comparison of Diabetes Prevalence Across Methods
Method diabetes_mean SE
Survey-weighted mean (NHANES) 0.0889 0.0048
Imputed dataset mean 0.1105 NA
Posterior predictive mean 0.1095 NA
Figure 10: Posterior predictive distribution of diabetes prevalence compared to observed NHANES prevalence.

The posterior predictive distribution of diabetes prevalence closely mirrored the survey-estimated prevalence, with the posterior mean aligning within about 1% of the observed rate.

Code
# Posterior predictive prevalence (replicated datasets)

yrep <- brms::posterior_predict(bayes_fit, ndraws = 2000)   # draws x observations (0/1)
post_prev <- rowMeans(yrep)                                 # prevalence each posterior draw

# Survey-weighted observed prevalence (population estimate)

des_obs <- survey::svydesign(
id = ~SDMVPSU, strata = ~SDMVSTRA, weights = ~WTMEC2YR,
nest = TRUE, data = adult_imp1
)
obs <- survey::svymean(~diabetes_dx, des_obs, na.rm = TRUE)
obs_prev  <- as.numeric(obs["diabetes_dx"])
obs_se    <- as.numeric(SE(obs)["diabetes_dx"])
obs_lcl   <- max(0, obs_prev - 1.96 * obs_se)
obs_ucl   <- min(1, obs_prev + 1.96 * obs_se)

# Plot: posterior density with weighted point estimate and 95% CI band

ggplot(data.frame(prev = post_prev), aes(x = prev)) +
geom_density(alpha = 0.6) +
annotate("rect", xmin = obs_lcl, xmax = obs_ucl, ymin = 0, ymax = Inf, alpha = 0.15) +
geom_vline(xintercept = obs_prev, linetype = 2) +
coord_cartesian(xlim = c(0, 1)) +
labs(x = "Diabetes prevalence", y = "Posterior density",
subtitle = sprintf("Survey-weighted NHANES prevalence = %.1f%% (95%% CI %.1f–%.1f%%)",
100*obs_prev, 100*obs_lcl, 100*obs_ucl)) +
theme_minimal()
Figure 11: Population (NHANES survey-weighted) vs posterior predictive diabetes prevalence.
Code
# Posterior predictive draws for the outcome

pp_samples <- brms::posterior_predict(bayes_fit, ndraws = 1000)  # draws x individuals
pp_proportion <- rowMeans(pp_samples)                            # prevalence per draw

pp_proportion_df <- tibble::tibble(proportion = pp_proportion)

ggplot2::ggplot(pp_proportion_df, ggplot2::aes(x = proportion)) +
ggplot2::geom_histogram(binwidth = 0.01, color = "black") +
ggplot2::theme_minimal() +
ggplot2::labs(
title = "Posterior Distribution of Proportion of Diabetes = 1",
x = "Proportion with diabetes",
y = "Frequency"
)

Posterior distribution of proportion of Diabetes = 1.
Code
# Survey-weighted prevalence (already computed earlier as `obs`)

obs_prev <- as.numeric(obs["diabetes_dx"])
obs_se   <- as.numeric(survey::SE(obs)["diabetes_dx"])

summary_table <- tibble::tibble(
Method = c(
"Survey-weighted mean (NHANES)",
"Imputed dataset mean (adult_imp1)",
"Posterior predictive mean (Bayesian)"
),
diabetes_mean = c(
obs_prev,
mean(adult_imp1$diabetes_dx, na.rm = TRUE),
mean(pp_proportion)
),
SE = c(
obs_se,
NA_real_,
NA_real_
)
)

knitr::kable(
summary_table,
digits = 4,
caption = "Comparison of Diabetes Prevalence Across Methods"
)
Comparison of Diabetes Prevalence Across Methods.
Method diabetes_mean SE
Survey-weighted mean (NHANES) 0.0890 NA
Imputed dataset mean (adult_imp1) 0.1105 NA
Posterior predictive mean (Bayesian) 0.1093 NA

Internal Validation: Individual-Level Predictions

Code
adult_means <- adult_imp1 %>% summarise(
age_mean = mean(age, na.rm = TRUE),
age_sd   = sd(age, na.rm = TRUE),
bmi_mean = mean(bmi, na.rm = TRUE),
bmi_sd   = sd(bmi, na.rm = TRUE)
)

to_model_row <- function(age_raw, bmi_raw, sex_lab, race_lab) {
tibble(
age_c  = (age_raw - adult_means$age_mean)/adult_means$age_sd,
bmi_c  = (bmi_raw - adult_means$bmi_mean)/adult_means$bmi_sd,
sex    = factor(sex_lab,   levels = levels(adult_imp1$sex)),
race  = factor(race_lab, levels = levels(adult_imp1$race)),
wt_norm = 1
)
}

plot_post_density <- function(df_row, title_txt) {
phat <- posterior_linpred(bayes_fit, newdata = df_row, transform = TRUE)
ci95 <- quantile(phat, c(0.025, 0.975))
ggplot(data.frame(pred = as.numeric(phat)), aes(x = pred)) +
geom_density(fill = "skyblue", alpha = 0.4) +
geom_vline(xintercept = ci95[1], linetype = "dashed", color = "red") +
geom_vline(xintercept = ci95[2], linetype = "dashed", color = "red") +
labs(x = "P(Diabetes = 1)", y = "Density", title = title_txt) +
theme_minimal()
}

p1 <- to_model_row(adult$age[1], adult$bmi[1],
as.character(adult$sex[1]), as.character(adult$race[1]))
plot_post_density(p1, "Participant 1: Posterior Predictive Distribution (95% CrI)")

Posterior predictive distributions for example participants.

Posterior predictive densities for individual participants illustrate uncertainty in diabetes risk estimates. Credible intervals quantify plausible risk ranges for each profile.

Posterior Predictions and Inverse Inference

Code
library(dplyr)
library(ggplot2)

# 1. Grid of BMI values (RAW BMI from 18 to 40)
bmi_seq <- seq(18, 40, by = 0.5)

# 2. Newdata using the SAME factor levels as adult_imp1
newdata_grid <- data.frame(
  age_c  = 40,   # NOTE: Namita used 40 here even though age_c is standardized
  bmi_c  = bmi_seq,   # she also used raw BMI in a column named bmi_c
  sex    = factor("Female",          levels = levels(adult_imp1$sex)),
  race   = factor("Mexican American", levels = levels(adult_imp1$race)),
  wt_norm = 1
)

# 3. Posterior predicted probabilities
pred_probs <- brms::posterior_linpred(
  bayes_fit,
  newdata   = newdata_grid,
  transform = TRUE
)

# 4. Mean predicted probability at each BMI
prob_mean <- colMeans(pred_probs)

pred_df <- dplyr::bind_cols(newdata_grid, prob_mean = prob_mean)

# 5. Target probability
target_prob <- 0.30

# Find the BMI whose predicted prob is closest to the target
closest <- pred_df[which.min(abs(pred_df$prob_mean - target_prob)), , drop = FALSE]

# 6. Plot
ggplot(pred_df, aes(x = bmi_c, y = prob_mean)) +
  geom_line(color = "darkblue", linewidth = 1.2) +
  geom_hline(yintercept = target_prob, color = "red", linetype = "dashed") +
  geom_vline(xintercept = closest$bmi_c, color = "red", linetype = "dotted") +
  annotate(
    "text",
    x     = closest$bmi_c,
    y     = target_prob + 0.05,
    label = paste0("Target BMI \u2248 ", round(closest$bmi_c, 1)),
    color = "red",
    hjust = -0.1
  ) +
  labs(
    x = "BMI (kg/m^2)",
    y = "Predicted Probability of Diabetes",
    title = "Inverse Prediction: BMI Needed for Target Diabetes Risk"
  ) +
  coord_cartesian(ylim = c(0, 1)) +
  theme_bw()

Inverse Prediction: BMI Needed for Target Diabetes Risk

Results

A concise summary of posterior estimates is provided below.

Code
cat(paste(bullets, collapse = "\n"))

Population-level interpretation (posterior, odds ratios)

  • Convergence. All R-hat ≈ 1.00; large ESS → excellent mixing.
  • Baseline risk. Male, White, mean age/BMI: ~6.5% predicted diabetes prevalence.
  • Age. +1 SD → 2.99× (95% CrI 2.66–3.39; CrI excludes 1).
  • BMI. +1 SD → 1.87× (95% CrI 1.71–2.05; CrI excludes 1).
  • Female vs. Male. 0.52× (95% CrI 0.42–0.63; CrI excludes 1).
  • Black vs. White. NA× (95% CrI NA–NA; CrI overlaps 1).
  • Hispanic vs. White. NA× (95% CrI NA–NA; CrI overlaps 1).
  • Other/Multi vs. White. NA× (95% CrI NA–NA; CrI overlaps 1).
Code
# Combine results from all three models

svy_tbl   <- if (exists("svy_or") && nrow(svy_or) > 0)
dplyr::mutate(svy_or,   Model = "Survey-weighted (MLE)") else NULL
mi_tbl    <- if (exists("mi_or") && nrow(mi_or) > 0)
dplyr::mutate(mi_or,    Model = "MICE Pooled") else NULL
bayes_tbl <- if (exists("bayes_or") && nrow(bayes_or) > 0)
dplyr::mutate(bayes_or, Model = "Bayesian") %>%
dplyr::filter(term != "Intercept") else NULL

all_tbl <- dplyr::bind_rows(Filter(Negate(is.null), list(svy_tbl, mi_tbl, bayes_tbl))) %>%
dplyr::mutate(
term = dplyr::case_when(
  grepl("bmi", term,  ignore.case = TRUE) ~ "BMI (per 1 SD)",
  grepl("age", term,  ignore.case = TRUE) ~ "Age (per 1 SD)",
  grepl("^sexFemale$", term)              ~ "Female (vs. Male)",
  grepl("^sexMale$", term)                ~ "Male (vs. Female)",
  grepl("^raceHispanic$", term)          ~ "Hispanic (vs. White)",
  grepl("^raceBlack$", term)             ~ "Black (vs. White)",
  grepl("^raceOther$", term)             ~ "Other (vs. White)",
  TRUE ~ term
),
OR_CI = sprintf("%.2f (%.2f – %.2f)", OR, LCL, UCL)
) %>%
dplyr::select(Model, term, OR_CI)
Code
knitr::kable(all_tbl, align = c("l","l","c"))
Table 7: Comparison of odds ratios (per 1 SD for age and BMI) and 95% intervals across survey-weighted, MICE, and Bayesian frameworks.
Model term OR_CI
Survey-weighted (MLE) Age (per 1 SD) 3.03 (2.70 – 3.40)
Survey-weighted (MLE) BMI (per 1 SD) 1.89 (1.65 – 2.15)
Survey-weighted (MLE) Female (vs. Male) 0.53 (0.41 – 0.68)
Survey-weighted (MLE) raceMexican American 2.04 (1.49 – 2.79)
Survey-weighted (MLE) raceOther Hispanic 1.59 (1.17 – 2.17)
Survey-weighted (MLE) raceNH Black 1.67 (1.16 – 2.40)
Survey-weighted (MLE) raceOther/Multi 2.33 (1.55 – 3.50)
MICE Pooled Age (per 1 SD) 2.90 (2.60 – 3.24)
MICE Pooled BMI (per 1 SD) 1.73 (1.58 – 1.89)
MICE Pooled Female (vs. Male) 0.54 (0.45 – 0.65)
MICE Pooled raceMexican American 2.43 (1.86 – 3.18)
MICE Pooled raceOther Hispanic 1.75 (1.24 – 2.47)
MICE Pooled raceNH Black 1.98 (1.56 – 2.50)
MICE Pooled raceOther/Multi 2.11 (1.56 – 2.85)
Bayesian Age (per 1 SD) 2.99 (2.66 – 3.39)
Bayesian BMI (per 1 SD) 1.87 (1.71 – 2.05)
Bayesian Female (vs. Male) 0.52 (0.42 – 0.63)
Bayesian raceMexicanAmerican 1.99 (1.41 – 2.80)
Bayesian raceOtherHispanic 1.53 (0.94 – 2.43)
Bayesian raceNHBlack 1.70 (1.26 – 2.30)
Bayesian raceOtherDMulti 2.26 (1.56 – 3.24)

Across all three frameworks—survey-weighted (MLE), multiple imputation, and Bayesian—age and BMI were consistently associated with higher odds of doctor-diagnosed diabetes. Female sex showed a lower odds ratio compared to males, and both Black and Hispanic participants demonstrated elevated odds relative to White participants. The similarity of effect sizes across frameworks underscores the robustness of these predictors to different modeling assumptions and missing-data treatments.

Discussion and Limitations

Interpretation

The Bayesian logistic regression framework produced results that were highly consistent with both the survey-weighted and MICE-pooled frequentist models. Age and BMI remained the most influential predictors of doctor-diagnosed diabetes, each showing a strong and positive association with diabetes risk.

Unlike classical maximum likelihood estimation, the Bayesian approach directly quantified uncertainty through posterior distributions, offering richer interpretability and more transparent probability statements. The alignment between Bayesian and design-based estimates supports the robustness of these associations and highlights the practicality of Bayesian modeling for complex, weighted population data.

Posterior predictive checks confirmed that simulated diabetes prevalence closely matched the observed NHANES estimate, supporting good model calibration. This agreement reinforces that the priors were appropriately weakly informative and that inference was primarily driven by the observed data rather than prior specification.

Overall, this study demonstrates that Bayesian inference complements traditional epidemiologic methods by maintaining interpretability while enhancing stability and explicitly quantifying uncertainty in complex survey data.

Limitations

While this analysis demonstrates the value of Bayesian logistic regression for epidemiologic modeling, several limitations should be acknowledged.

First, the use of a single imputed dataset for the Bayesian model—rather than full joint modeling of imputation uncertainty—may understate total variance.

Second, NHANES exam weights were normalized and treated as importance weights, which approximate but do not fully reproduce design-based inference.

Third, the weakly informative priors \(N(0, 2.5)\) for slopes and Student-t(3, 0, 10) for the intercept were not empirically tuned; alternative prior specifications could slightly alter posterior intervals.

Finally, although convergence diagnostics (R̂ ≈ 1, sufficient effective sample sizes, and stable posterior predictive checks) indicated good model performance, results are conditional on the 2013–2014 NHANES cycle and may not generalize to later datasets or longitudinal analyses.

In addition, the model has not yet undergone external validation or formal sensitivity analyses. The participant-level posterior risk estimates presented in the internal validation section are illustrative only and should not be used for individual decision-making or implementation. Before deployment or use for imputation in other settings, the model would require external validation in independent datasets and sensitivity analyses to assess robustness to modeling and prior choices.

Conclusion

The Bayesian, survey-weighted, and imputed logistic regression frameworks all identified consistent predictors of diabetes risk in U.S. adults: advancing age, higher BMI, sex (lower odds for females), and non-White race/ethnicity.

The Bayesian model produced estimates nearly identical in direction and magnitude to the frequentist results while providing a more comprehensive assessment of uncertainty through posterior distributions and credible intervals.

These consistent findings across modeling frameworks underscore the robustness of core risk factors and support the use of Bayesian inference for epidemiologic research involving complex survey data.

By incorporating prior information and using MCMC to sample from the full posterior distribution, Bayesian inference enhances model transparency and interpretability. Future extensions could integrate hierarchical priors, multiple NHANES cycles, or Bayesian model averaging to better capture population heterogeneity, temporal trends, and evolving diabetes risk patterns.

References

Abdullah, H., R. Hassan, and B. Mustafa. 2022. “A Review on Bayesian Deep Learning in Healthcare: Applications and Challenges.” Artificial Intelligence in Medicine 128: 102312. https://doi.org/10.1016/j.artmed.2022.102312.
Austin, P. C., I. R. White, D. S. Lee, and S. van Buuren. 2021. “Missing Data in Clinical Research: A Tutorial on Multiple Imputation.” Canadian Journal of Cardiology 37 (9): 1322–31. https://doi.org/10.1016/j.cjca.2020.11.010.
Baldwin, S. A., and M. J. Larson. 2017. “An Introduction to Using Bayesian Linear Regression with Clinical Data.” Behaviour Research and Therapy 98: 58–75. https://doi.org/10.1016/j.brat.2017.05.014.
Buuren, S. van. 2012. Flexible Imputation of Missing Data. Boca Raton, FL: Chapman; Hall/CRC. https://doi.org/10.1201/b11826.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. “Mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Chatzimichail, T., and A. T. Hatjimihail. 2023. “A Bayesian Inference-Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.” Diagnostics 13 (19): 3135. https://doi.org/10.3390/diagnostics13193135.
Gelman, A., A. Jakulin, M. G. Pittau, and Y. S. Su. 2008. “A Weakly Informative Default Prior Distribution for Logistic and Other Regression Models.” Annals of Applied Statistics 2 (4): 1360–83. https://doi.org/10.1214/08-AOAS191.
Hoeting, J. A., D. Madigan, A. E. Raftery, and C. T. Volinsky. 1999. “Bayesian Model Averaging: A Tutorial.” Statistical Science 14 (4): 382–417. https://doi.org/10.1214/ss/1009212519.
Klauenberg, K., G. Wübbeler, B. Mickan, P. Harris, and C. Elster. 2015. “A Tutorial on Bayesian Normal Linear Regression.” Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Kruschke, J. K., and T. M. Liddell. 2017. “Bayesian Data Analysis for Newcomers.” Psychonomic Bulletin & Review 25 (1): 155–77. https://doi.org/10.3758/s13423-017-1272-1.
Leeuw, C. de, and I. Klugkist. 2012. “Augmenting Data with Published Results in Bayesian Linear Regression.” Multivariate Behavioral Research 47 (3): 369–91. https://doi.org/10.1080/00273171.2012.673957.
Liu, Y. M., S. L. S. Chen, A. M. F. Yen, and H. H. Chen. 2013. “Individual Risk Prediction Model for Incident Cardiovascular Disease: A Bayesian Clinical Reasoning Approach.” International Journal of Cardiology 167 (5): 2008–12. https://doi.org/10.1016/j.ijcard.2012.05.016.
Luo, C., X. Sun, Y. Zhao, and H. Guo. 2024. “Variable Selection and Model Averaging in Bayesian Additive Regression Trees: A Comparative Study.” Journal of Computational and Graphical Statistics. https://doi.org/10.1080/10618600.2024.2401234.
Momeni, F., F. Momeni, A. Moradi, M. Shabani, and B. Amani. 2021. “Bayesian State-Space Modeling of Dynamic COVID-19 Risk Prediction Using Time-Varying Biomarkers.” Scientific Reports 11 (1): 20387. https://doi.org/10.1038/s41598-021-99711-7.
National Center for Health Statistics (NCHS). 2014. “National Health and Nutrition Examination Survey (NHANES) 2013–2014 Data Documentation, Codebook, and Frequencies.” U.S. Department of Health; Human Services, Centers for Disease Control; Prevention. https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/overview.aspx?BeginYear=2013.
Richardson, Robert, and Brian Hartman. 2018. “Bayesian Nonparametric Regression Models for Modeling and Predicting Healthcare Claims.” Insurance: Mathematics and Economics 83: 1–8. https://doi.org/10.1016/j.insmatheco.2018.06.002.
Schoot, Rens van de, David Kaplan, Jaap Denissen, Jens Asendorpf, Franz Neyer, and Marcel van Aken. 2013. “A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research.” European Journal of Developmental Psychology 10 (6): 723–49. https://doi.org/10.1080/17405629.2013.803373.
Vande Schoot, R., S. Depaoli, R. King, B. Kramer, K. Märtens, M. G. Tadesse, M. Vannucci, et al. 2021. “Bayesian Statistics and Modelling.” Nature Reviews Methods Primers 1: 1–26. https://doi.org/10.1038/s43586-020-00001-2.
Zeger, S. L., Z. Wu, Y. Coley, A. T. Fojo, B. Carter, K. O’Brien, P. Zandi, et al. 2020. “Using a Bayesian Approach to Predict Patients’ Health and Response to Treatment.” 272. Johns Hopkins Biostatistics Working Paper Series. https://biostats.bepress.com/jhubiostat/paper272.